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Abstract 

While important advances were recently made towards 
temporally localizing and recognizing specific human ac¬ 
tions or activities in videos, efficient detection and clas¬ 
sification of long video chunks belonging to semantically- 
defined categories such as ‘"pursuit” or “romance” remains 
challenging. 

We introduce a new dataset, Action Movie Franchises, 
consisting of a collection of Hollywood action movie fran¬ 
chises. We define 11 non-exclusive semantic categories — 
called beat-categories — that are broad enough to cover 
most of the movie footage. The corresponding heat-events 
are annotated as groups of video shots, possibly overlap¬ 
ping. We propose an approach for localizing beat-events 
based on classifying shots into beat-categories and learn¬ 
ing the temporal constraints between shots. We show that 
temporal constraints significantly improve the classification 
performance. We set up an evaluation protocol for beat- 
event localization as well as for shot classification, depend¬ 
ing on whether movies from the same franchise are present 
or not in the training data. 


1. Introduction 

Automatic understanding and interpretation of videos is 
a challenging and important problem due to the massive in¬ 
crease of available video data, and the wealth of seman¬ 
tic variety of video content. Realistic videos include a 
wide variety of actions, activities, scene type, etc. Dur¬ 
ing the last decade, significant progress has been made for 
action retrieval and recognition of specific, stylized, hu¬ 
man actions. In particular, powerful visual features were 
proposed towards this goal 121, 22, 33]. For more gen¬ 
eral types of events in videos, such as activities, efficient 
approaches were proposed and benchmarked as part of 
the TrecVid Multimedia Event Detection (MED) compe¬ 
titions 123]. State-of-the-art approaches combine features 
from all modalities (text, visual, audio), static and motion 
features (possibly learned beforehand with deep learning), 
and appropriate fusion procedures. 

In this work, we aim at detecting events of the same se¬ 
mantic level as Trecvid MED, but on real action movies 



Figure 1. Example frames for the categories from the Action 
Movie Franchises dataset. 
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Figure 2. Temporal structure of a movie, according to the tax¬ 
onomy of “Save the Cat” [31], and our level of annotation, the 
beat-event. 


that follow a structured scenario. From a movie script¬ 
writer’s point of view [31, 28], a Hollywood movie is more 
or less constrained to a set of standard story-lines. This 
standardization helps matching the audience expectations 
and habits. However, movies need to be fresh and novel 
enough to fuel the interest of the audience. So, some vari¬ 
ability must be introduced in the story lines to maintain the 
interest. Temporally, movies are subdivided in a hierarchy 
of acts, scenes, shots, and finally, frames (see Figure 2). 
Punctual changes in the storyline give it a rhythm. They are 
called “beats” and are common to many films. A typical ex¬ 
ample of beat is the moment when an unexpected solution 
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saves the hero. 

From a computer vision point of view, frames are read¬ 
ily available and reliable algorithms for shot detection exist. 
Grouping shots into scenes is harder. Scenes are character¬ 
ized by a uniform location, set of characters or storyline. 
The semantic level of beats and acts is out of reach. We 
propose here to attack the problem on an intermediate level 
by detecting “beat-events”. Temporally, they consist in se¬ 
quences of consecutive shots and typically last a few min¬ 
utes. Shots offer a suitable granularity, because movies are 
edited so that they follow the rhythm of the action. Semanti¬ 
cally, they are of a higher level than the actions in most cur¬ 
rent benchmarks, but lower than the beats, which are hard 
to identify even for a human. 

For the purpose of research, we built an annotated dataset 
of Hollywood action movies, called Action Movie Fran¬ 
chises. It comprises 20 action movies from 5 franchises: 
Rambo, Rocky, Die Hard, Lethal Weapon, Indiana Jones. 
A movie franchise refers to a series of movies on the same 
“topic”, sharing similar story lines and the same characters. 
In each movie, we annotate shots into several non-exclusive 
beat-categories. We then create a higher level of annotation, 
called beat-events, which consists of consistent sequences 
of shots labeled with the same beat-category. 

Figure 1 illustrates the beat-categories that we use in our 
dataset. They are targeted at action movies and, thus, rely 
on semantic categories that often reply on the role of the 
characters, such as hero (good) or villain (bad). We now 
briefly describe all categories. First, we deflne three dif¬ 
ferent action-related beat-categories: pursuit, battle prepa¬ 
ration and battle, shown in the first row of Fig. 1. We 
also deflne categories centered on the emotional state of 
the main characters: romance, despair good (e.g. when the 
hero thinks that all is lost) and joy bad (e.g. when the villain 
thinks he won the game), see second row of Fig. 1. We also 
include different categories of dialog between all combina¬ 
tions of good and bad characters: good argue good, good 
argue bad and bad argue bad (third row of Fig. 1). Finally, 
we add two more categories notifying a temporary victory 
of a good or bad character (victory good and victory bad, 
last row of Fig. 1). We also consider a NULL category, cor¬ 
responding to shots that can not be classified into any of the 
aforementioned beat-categories. 

In summary, we introduce the Action Movie Fran¬ 
chises dataset, which features dense annotations of 11 beat- 
categories in 20 action movies at both shot and event levels. 
To the best of our knowledge, a comparable dense annota¬ 
tion of videos does not exist. 

The semantic level of our beat-categories will drive 
progress in action recognition towards new approaches 
based on human identity, pose, interaction and semantic au¬ 
dio features. State-of-the-art methods are without doubt not 
sufficient for such categories. Action movies and related 


professionally produced content account for a major frac¬ 
tion of what people watch on a daily basis. There exists 
a large potential for applications, such as access to video 
archives and movie databases, interactive television and au¬ 
tomatic annotation for the shortsighted. 

Furthermore, we define several evaluation protocols, to 
investigate the impact of franchise-information (testing with 
or without previously seen movies from the same franchise) 
and the performance for both classification and localiza¬ 
tion tasks. We also propose an approach for classification 
of video shots into beat-categories based on a state-of-the- 
art pipeline for multimodal feature extraction, classification 
and fusion. Our approach for localizing beat-events uses a 
temporal structured inferred by a conditional random field 
(CRF) model learned from training data. 

We will make the Action Movie Franchises dataset pub¬ 
licly available upon publication to the research community 
to further advance video understanding. 

2. Related work 

Related datasets. Table 1 summarizes recent state-of- 
the-art datasets for action or activity recognition. Our Ac¬ 
tion Movie Franchises dataset mainly differs from existing 
ones with respect to the event complexity and the density 
of annotations. Similarly to Coffee & Cigarettes and Me- 
diaEval Violent Scene Detection (VSD), our Action Movie 
Franchises dataset is built on professional movie footage. 
However, while the former datasets only target short and 
sparsely occurring events, we provide dense annotations 
of beat-events spanning larger time intervals. Our beat- 
categories are also of significantly higher semantic level 
than those in action recognition datasets like Coffee & 
Cigarettes, UCF [32] and HMDB [14]. A consequence is 
that our dataset remains very challenging for state-of-the- 
art algorithms, as shown later in the experiments. Events of 
a similar complexity can be found in TrecVid MED 2011- 
2014 [23], but our dataset includes precise temporally lo¬ 
calized annotations. 

Action detection in movies. Action detection (or action 
localization), that is finding if and when a particular type 
of action was performed in long and unsegmented video 
streams, received a lot of attention in the last decade. The 
problem was considered in a variety of settings: from still 
images [27], from videos [9, 33], with or without weak su¬ 
pervision, etc. Most works focused on highly stylized hu¬ 
man actions such as “open door”, “sit down”, which are 
typically temporally salient in the video stream. 

Action or activity recognition can often be boosted using 
temporal reasoning on the sequence of atomic events that 
characterize the action, as well as the surrounding events 
that are likely to precede or follow the action/activity of in¬ 
terest. We shall only review here the “temporal context” 


Name 

# classes 

example 

class 

annotation 

unit 

# train 
units 

avg unit 

durations 
annot NULL 

coverage 

Classification 

UCF 101 [32] 

101 

high jump 

clip 

13320 

7.21s 

26h39 

Oh 

- 

HMDB 51 [14] 

51 

brush hair 

clip 

6763 

3.7s 

6h59 

Oh 

- 

TrecVid MED 11 

15 

birthday party 

clip 

2650 

2m54 

128h 

315h 

29% 

Action Movie Franchises 

11 

good argue bad 

shot 

16864 

5.4s 

25h29 

15h42 

57.1% 

Localization 

Coffee & Cigarettes 

2 

drinking 

time interval 

191 

2.2s 

7ml2s 

3h26 

3.3% 

THUMOS detection 2014 

20 

floor gymnastics 

t.i. on clip 

3213 

26.2s 

3h22 

167h54 

2.0% 

MediaEval VSD [ ] 

10 

fighting 

shot/segment 

3206 

3.0s 

2h38 

55h20 

4.5% 

Action Movie Franchises 

11 

good argue bad 

beat-event 

2906 

35.7s 

28h49 

14h08 

61.4% 


Table 1. Comparison of classification and localization datasets, annot = total duration of all annotated parts; NULL = duration of the 


non-annotated (NULL or background) footage; coverage = proportion of annotated video footage. 


information from surrounding events; the decomposition of 
action or activities into sequence of atomic events [9] is be¬ 
yond the scope of our paper. Early works along this line [29] 
proposed to group shots and organize groups into “seman¬ 
tic” scenes, each group belonging exclusively to only one 
scene. Results were evaluated subjectively and no user 
study was conducted. 

Several papers proposed to use movie (or TV series) 
scripts to leverage the temporal structure [ , 19]. In [19], 
movie scripts are used to obtain scene and action annota¬ 
tions. Retrieving and exploiting movie scripts can be tricky 
and time-consuming. In many cases, movie scripts are sim¬ 
ply not available. Thus, we did not use movie scripts to 
build our dataset and do not consider this information for 
training and testing. However, we do use another modality, 
the audio track, in a systematic way, and perform fusion fol¬ 
lowing state-of-the-art approaches in multimedia [17], and 
TrecVid competitions [23]. 

In [4], the authors structure a movie into a sequence 
of scenes, where each scene is organized into interlaced 
threads. An efficient dynamic programming algorithm for 
structure parsing is proposed. Experimental results on a 
dataset composed of TV series and a feature-length movie 
are provided. More recently, in [2], actors and their ac¬ 
tions are detected simultaneously under weak supervision 
of movies scripts using discriminative clustering. Exper¬ 
imental results on 2 movies (Casablanca and American 
beauty) are presented, for 3 actions (walking, open door 
and sit down). The approach improves person naming com¬ 
pared to previous methods. In this work, we do not use su¬ 
pervision from movie scripts to learn and uncover the tem¬ 
poral structure, but rather learn it directly using a condi¬ 
tional random field that takes SVM scores as input features. 
The proposed approach is more akin to [11], where joint 
segmentation and classification of human actions in video 
is performed on toy datasets [10]. 


3. Action Movie Franchises 

We first describe the Action Movie Franchises dataset 
and the annotation protocol. Then, we highlight some 
striking features in the structure of the movies observed 
during and after the annotation process. Einally, we pro¬ 
pose an evaluation protocol for shot classification into beat- 
categories and for beat-event localization. 

3.1. The movies 

The Action Movie Eranchises dataset consists of 20 Hol¬ 
lywood action movies belonging to 5 famous franchises: 
Rambo, Rocky, Die Hard, Lethal Weapon, Indiana Jones. 
Each franchise comprises 4 movies; see Table 1 for sum¬ 
mary statistics of the dataset. 

Each movie is decomposed into a list of shots, extracted 
with a shot boundary detector [20, 25]. Each shot is tagged 
with zero, one or several labels corresponding to the 11 
beat-categories (the label NULL is assigned to shots with 
zero labels). Note that the total footage for the dataset is 
36.5 h, shorter than the total length in Table 1. This is due 
to multiple labels. All categories are shown in Eigure 1. 

Series of shots with the same category label are grouped 
together in heat-events if they all depict the same scene (ie. 
same characters, same location, same action, etc.). Tem¬ 
porally, we also allow a beat-event to bridge gaps of a few 
unrelated shots. Beat-events belong to a single, non-NULL, 
beat-category. 

The set of categories was inspired by the taxonomy 
of [31], and motivated by the presence of common narrative 
structures and beats in action movies. Indeed, category def¬ 
initions strongly rely on a split of the characters into “good” 
and “bad” tags, which is typical in such movies. Each cat¬ 
egory thus involves a fixed combination of heroes and vil¬ 
lains: both “good” and “bad” characters are present during 
battle and pursuit, but only “good” heroes are present in the 
case of good argue good. 

Large intra-class variation is due to a number of factors: 
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Figure 3. Top: Beat-events annotated for the Action Movie Franchises dataset, one movie per line, plotted along the temporal axis. All 
the movies were scaled to the same length. Bottom: zoom on a movie extract showing the shot segmentation, the annotations and the 
beat-events. Best viewed onscreen. 


duration, intensity of action, objects and actors, and differ¬ 
ent scene locations, camera viewpoint, filming style. For 
ambiguous cases we used the “difficult” tag. 

3.2. Annotation protocol 

The annotation process was carried out in two passes by 
three researchers. Ambiguous cases were discussed and re¬ 
sulted in a clear annotation protocol. In the first pass we 
manually annotated each shot with zero, one or several of 
the 11 beat-category labels. In the second one we annotated 
the beat-events by specifying their category, beginning and 
ending shots. We tolerated gaps of 1-2 unrelated shots for 
sufficiently consistent beat-events. Indeed, movies are often 
edited into sequences of interleaved shots from two events, 
e.g. between the main storyline and the “B” story. 

Some annotations are labeled as “difficult”, if they are 
semantically hard to detect, or ambiguous. For instance, 
in Indiana Jones 3, Indiana Jones engages in a romance 
with Dr. Elsa Schneider, who actually betrays him to the 
“bad guy”. Romance between Indiana Jones and Dr. Elsa 
Schneider is therefore ambiguous. We exclude these shots 
at training and evaluation time, as in the Pascal evaluation 
protocol [8]. 

Our beat-event annotations cover about 60 % of the 
movie footage, which is much higher than comparable 
datasets, see Table 1. This shows that the vocabulary we 
chose is representative: the dataset is annotated densely. 


3.3. Highlighting structure of action movies 

Eigure 3 shows the sequence of category-label annota¬ 
tions for several movies. Some global trends are strik¬ 
ing: victory good occurs at the end of movies; battle is 
most prevalent in the last quarter of movies; there is a 
pause in fast actions (battle, pursuit) around the middle of 
the movies. In movie script terms, this is the “midpoint” 
beat [31], where the hero is at a temporary high or low in the 
story. In terms of beat-event duration, joy bad and victory 
bad are short, while pursuit and romance are long. These 
trends can be learned by the temporal re-scoring to improve 
the shot classification results. 

After careful analysis of the annotation, we find that 
battle, despair good and pursuit are the most prevalent 
beat-categories, with 4145, 3042 and 2416 instances re¬ 
spectively. Since it is a semantically high level class, de¬ 
spair good is most often annotated as difficult. The co¬ 
occurrences of classes as annotations of the same shot fol¬ 
low predictable trends: battle co-occurs with pursuit, bat¬ 
tle preparation, victory good and victory bad. Interestingly 
romance is often found in combination with despair good. 
This is typical for movies of the “Dude with a problem” 
type [31], where the hero must prove himself. 

Within each movie franchise, a shared structure may ap¬ 
pear. Eor instance, in Rocky, the battle preparation occurs 
in the last quarter of the movie, and there is no pursuit. 
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Figure 4. The two types of split for evaluation. In addition to the 
train/test splits, the training videos are also split in 4 sub-folds, that 
are used for cross-validation and CRF training purposes. 


3.4. Evaluation protocol 

In the following, we propose two types of train/test splits 
and two performance measures for our Action Movie Fran¬ 
chises dataset. 

Data splits. We consider two different types of splits over 
the 20 movies; see Figure 4. They both come in 5 folds of 16 
training movies and 4 test movies. All movies appear once 
as a test movie. In the “leave one franchise out” setting, 
all movies from a single franchise are used as a test set. In 
“leave 4 movies out”, a single movie from each franchise 
is used as test. This allows to evaluate if our classifiers are 
specific to a franchise or generalize well across franchises. 

Classification setting. In the classification setting, we 
evaluate the accuracy of beat-category prediction at the shot 
level. Since a shot can have several labels, we adopt the fol¬ 
lowing evaluation procedure. For a given shot with n > 0 
ground-truth labels (in general n = 1, but the number of 
labels can be up to 4), we retain the best n predicted beat- 
categories (out of 11, according to their confidence scores). 
Accuracy is then measured independently for each beat- 
category as the proportion of ground-truth shots which are 
correctly labeled. We finally average accuracies over all cat¬ 
egories, and report the mean and the standard deviation over 
the 5 cross-validation splits. 

Localization setting. In the localization setting, we eval¬ 
uate the temporal agreement between ground-truth and pre¬ 
dicted beat-events for each beat-category. A detection, con¬ 
sisting of a temporal segment, a category label and a confi¬ 
dence score, is tagged positive if there exists a ground-truth 
beat-event with an intersection-over-union score [8] over 
0.2. If the ground-truth beat-event is tagged as “difficult” 
it does not count as positive nor negative. The performance 


is measured for each beat-category in terms of average pre¬ 
cision (AP) over all beat-events in the test fold, and the dif¬ 
ferent APs are averaged to a mAP measure. 

4. Shot and beat-event classification 

The proposed approach consists of 4 stages. First, we 
compute high-dimensional shot descriptors for different vi¬ 
sual and audio modalities, called channels. Then, we learn 
linear SVM classifiers for each channel. At the late fusion 
stage, we take the linear combination of the channel scores. 
Finally, predictions are refined by leveraging the temporal 
structure of the data and beat-events are localized. 

4.1. Descriptors extraction 

For each shot from a movie, we extract different descrip¬ 
tors corresponding to different modalities. For this purpose, 
we use a state-of-the art set of low-level descriptors [1, 21]. 
It includes still image, face, motion and audio descriptors: 
Dense SIFT [18] descriptors are extracted every 30’th 
frame. The SIFTs of a frame are aggregated into a Fisher 
vector of 256 mixture components, that is power- and 
L2-normalized [24] . The shot descriptor is the power- and 
L2 normalized average of the Fisher descriptors from its 
frames. The output descriptor has 34559 dimensions. 
Convolutional neural nets (CNN) descriptors are ex¬ 
tracted from every 30’th frame. We run the image through 
a CNN [13] trained on Imagenet 2012, using the activations 
from the first fully-connected layer as a description vector 
(FC6 in 4096 dimensions). The implementation is based on 
DeCAF [6] and its off-the-shelf pre trained network. 
Motion descriptors are extracted for each shot. We 
extract improved dense trajectory descriptors [33]. The 4 
components of the descriptor (MBHx, MBHy, HoG, HoF) 
are aggregated into 4 Fisher vectors that are concatenated. 
This output is a 108544 D vector. 

Audio descriptors are based on MFCC [26] extracted 
for 25 ms audio chunks with a step of 10 ms. They are 
enhanced by adding first and second order temporal deriva¬ 
tives. The MFCCs are aggregated into a shot descriptor 
using a Fisher aggregation, producing a 20223 D vector. 
Face descriptors are obtained by first detecting faces 
in each frame using the Viola-Jones detector from 
OpenCV [3]. Following the approach from [ 7 ], we join the 
detections into face tracks using the KLT tracker, allowing 
us to recover some missed detections. Each facial region 
is then described with a Fisher vector of dense SIFTs [30] 
(16384 dimensions) which is power- and L2-normalized. 
Finally, we average-pool all face descriptors within a shot 
and normalize again the result to obtain the final shot 
descriptor. 

Overall, each 2 hr movie is processed in 6 hr on a 16-core 
machine. We will make all descriptors publicly available. 
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Figure 5. Proposed training approach for one fold. In a first stage, 
SVMs SVM1...SVM4 are trained in leaving one sub-fold out of 
the training set, and are evaluated on the left-out sub-fold. In a 
second stage, a CRF model is trained, taking the sub-fold SVMs 
scores as inputs. We then use all the training videos to train the 
final SVM model (SVMtest). The final model outputs scores on the 
test fold, which are then refined by the CRF model. Note that each 
SVM training includes calibration using cross validation. 


4.2. Shot classification with SVMs 

We now detail the time-blind detection method, that 
scores each shot independently without leveraging tempo¬ 
ral structure. 

Per-channel training of SVMs. The 5 descriptor chan¬ 
nels are input separately to the SVM training. For each 
channel and for each beat-category, we use all shots an¬ 
notated as non-difficult as positive examples and all other 
shots (excluding difficult ones) as negatives to train a shot 
classifier. We use a linear SVM and cross-validate the C 
parameter, independently for each channel. We compute 
one classifier SVMtest pcr fold, and 4 additional classifiers 
SVM 1 ...SVM 4 corresponding to sub-folds, see Figure 5. 

Late fusion of per-channel scores The per-channel 
scores are combined linearly into a shot score. For one fold, 
the linear combination coefficients are estimated using the 
sub-fold scores. We use a random search over the 5D space 
of coefficients to find the one that maximizes the average 
precision over the sub-folds. This optimization is performed 
jointly over all classes (shared weights), which was found 
to be better to reduce the variability of the weights. 

4.3. Leveraging temporal structure 

We leverage the temporal structure to improve the per¬ 
formance of the time-blind detection/localization method, 
using a conditional random field (CRF) [15]. We consider 
a CRF that takes the SVM scores as inputs. The CRF re¬ 
lies on a linear chain model. Unary potentials correspond to 
votes for the shot labels, while binary potentials model the 


probability of the sequences. 

We model a video with a linear chain CRF. It consists of 
latent nodes yi G that correspond to shot 

labels. Similar to HMM, each node yi has a corresponding 
input data point xi G Variables xi are always observed, 
whereas yi are known only for training data. An input data 
point Xi G corresponds to the shot descriptor, which in 
our case is the 11-D vector of L2-normalized SVM scores 
for each beat-category. The goal is to infer probabilities of 
shot labels for the test video. 

The CRF model for one video is defined as: 

n n—1 

logp{Y\X;X,^l) = f{yi,X)+'^n'^g{yi,yi+i,X), 

i=l i=l 

where the inputs are X = {xi,..., and the outputs 
y = {^ 1 , • • • 5 ^n}- We use the following feature (in the 
CRF literature sense) functions / and g\ 


fk{yi,X) = Xi^k5{yi,k) 

9k',k"{yi,yi+i,X) = S{yi,k')S{yi+i,k") 

where Xi^k is the classification score of shot i for category 
k, 6{x, y) is 1 when x = y and 0 otherwise. Therefore, the 
log-likelihood becomes 


n 

logp(y|X;A,/x) = k)-\- 

key i=i 

n —1 

^ ^ k'k' ,k" ^ ^ ^ ^ ) 

k',k"ey *=i 

{k',k")^{c,c) 

We take Xi from SVM classifiers trained using cross valida¬ 
tion on the training data. The CRF is learned by minimizing 
the negative log-likelihood in order to estimate A and /j,. 

At test time, the CRF inference outputs marginal condi¬ 
tional probabilities p{yi\X),i = 1 ,..., n. 

4.4. Beat-event localization 

The final step consists in localizing instances of a beat- 
event in a movie, given confidence scores output by the 
CRF. To that aim, shots must be grouped into segments, 
and a score must be assigned to the segments. We create 
segments by joining consecutive shots for which CRF con¬ 
fidence is above 30% of its maximum over the movie. The 
segment’s score is the average of these shot confidences. 

Note that the CRF produces smoother scores over time 
for events that occur at a slower rhythm, see Figure 7. For 
example “good argue good” lasts usually longer than “joy 
bad”, because the villain is delighted for a short time only. 

The CRF smoothing modulates the length of estimated seg¬ 
ments: smoother curves produce longer segments, as ex¬ 
pected. 



























pursuit 

battle 

romance 

victory good 

victory bad 

battle 

preparation 

despair good 

joy bad 

good argue 

bad 

good argue 

good 

bad argue bad 

mean 

accuracy 

SIFT 

53.8 

76.4 

23.9 

11.7 

4.4 

Leave 4 movies out 

22.1 15.0 9.5 

15.1 

25.5 

4.0 

23.76 ± 5.26 

CNN 

66.4 

60.0 

16.6 

6.0 

2.4 

9.4 

21.7 

6.6 

17.7 

30.2 

4.7 

21.96 ± 5.91 

dense trajectories 

58.5 

85.2 

38.0 

12.7 

6.2 

28.0 

19.5 

11.6 

18.8 

40.4 

1.8 

29.15 ± 6.12 

MFCC 

28.1 

56.3 

4.5 

17.7 

36.2 

3.8 

35.4 

15.6 

17.3 

26.5 

0.0 

21.95 ± 13.97 

Face descriptors 

47.9 

58.1 

8.6 

12.7 

11.4 

17.3 

9.3 

3.2 

6.2 

22.3 

4.7 

18.35 ± 10.50 

linear score combination 

63.9 

89.2 

32.3 

14.0 

11.4 

18.6 

26.0 

12.1 

18.0 

44.3 

1.8 

30.15 ± 6.72 

+ CRF 

76.0 

91.2 

57.6 

19.9 

1.0 

41.4 

43.1 

9.6 

25.1 

44.8 

0.0 

37.25 ± 9.94 

linear score combination 

57.8 

83.6 

13.0 

14.9 

9.6 

Leave 1 franchise out 

3.8 28.0 5.2 

18.2 

44.3 

0.0 

25.32 ± 7.40 

+ CRF 

75.4 

87.4 

31.3 

15.8 

0.0 

12.7 

33.4 

5.7 

23.2 

43.7 

0.0 

29.89 ± 12.11 


Table 2. Performance comparison (accuracy) for shot classification. Standard deviations are computed over folds. 


CRF -1- thresholding 

34.6 

38.9 

22.6 

14.6 

4.4 

Leave 4 movies out 

26.7 6.4 4.6 

12.2 

16.9 

0.6 

16.59 ± 6.82 

CRF -1- thresholding 

36.8 

36.5 

28.9 

14.3 

4.5 

Leave 1 franchise out 

1.7 4.2 5.2 

6.5 

13.5 

3.7 

14.16 ± 6.84 


Table 3. Performance comparison (average precision) for beat-event localization. 
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Figure 7. Example of localization results, for several beat-categories and movies. For each plot, detected beat-events are indicated with 
bold rectangles (green/gray/red indicate correct/ignored/wrong detections). Ground-truth (GT) annotations are indicated above (beat-events 
marked as difficult appear hatched), and likewise missed detections are highlighted in red. Most often, occurrences of the beat-events are 
rather straightforward to localize given the CRF scores. 


5. Experiments 

After validating the processing chain on a standard 
dataset, we report classification and localization perfor¬ 
mance. 

5.1. Validation of the classification method 

To make sure that our descriptors and classification chain 
is reliable, we run it on the small Coffee & Cigarettes [16] 
dataset, and compare the results to the state-of-the-art 
method of Oneata et al. [21]. For this experiment, we score 
fixed-size segments and use their non-maximum suppres¬ 


sion method NMS-RS-0. We obtain 65.5 % mAP for the 
“drinking” action and 45.4 % mAP for “smoking”, which 
is close to their performance (63.9 % and 50.5 % respec¬ 
tively). 

5.2. Shot classification 

Table 2 shows the classification performance at the shot- 
level on the two types of splits. The low-level descriptors 
that are most useful in this context are the dense trajec¬ 
tories descriptors. Compared to setups like Trecvid MED 
or Thumos [23, 12], the relative performance of audio de- 




























































































Figure 6. Confusion matrix for shot classification with SVM and 
linear score combination for the “leave 4 movies out” setting. 



Figure 8. Sample faces corresponding to shots for which the face 
classifier {i.e. SVM trained on faces) scored much higher than the 
SIFT classifier {Le. trained on full images). Similar facial expres¬ 
sions can be observed within each beat-category, which suggests 
that our face classifier learns to recognize human expressions to 
some extent. 


scriptors (MFCC) is high, overall the same as for e.g. CNN. 
This is because Hollywood action movies have well con¬ 
trolled soundtracks that almost continuously plays music: 
the rhythm and tone of the music indicates the theme of 
the action occurring on screen. Therefore, the MFCC audio 
descriptors convey high-level information that is relatively 
easy to detect automatically. 

The face descriptor can be seen as a variant of SIFT, re¬ 
stricted to facial regions. The face channel classifier out¬ 
performs SIFT in three categories. Upon inspection, we no¬ 
ticed however that only a fraction of shots actually contain 
exploitable faces {e.g. frontal, non-blurred, unoccluded and 
large enough), which may explain the lower performance 
for other categories. The performance of the face channel 
classifier may be attributed to a rudimentary facial expres¬ 
sion recognition property: the faces of heroes arguing with 
other good characters can be distinguished from the grin of 
the villain in joy bad; see Figure 8. 


The 4 least ambiguous beat-categories (pursuit, battle, 
battle preparation and romance) are detected most reliably. 
They account for more than half of the annotated shots. The 
other categories are typically interactions between people, 
which are defined by identity and speech rather than mo¬ 
tion or music. The confusion matrix in Figure 6 shows that 
verbal interactions like “good argue good” and “good argue 
bad” are often confused. 

The “leave-4-movies out” setting obtains significantly 
better results than “Leave-1-franchise out”, meaning that 
having seen movies from a franchise makes it easier to rec¬ 
ognize what is happening in a new movie of the franchise: 
Rambo does not fight in the same way as Rocky. Finally, the 
CRF allows to leverage temporal structure using the tempo¬ 
rally dense annotations, improving the classification perfor¬ 
mance by 7 points. 

5.3. Beat-event localization 

Table 3 gives results for beat-event localization. We ob¬ 
serve that the performance is low for the least frequent ac¬ 
tions. Indeed, for 8 out of II categories, the performance 
is below 15% AR Per-channel results are not provided due 
to lack of space, but their relative performance is similar to 
the classification ones. Figure 7 displays localization results 
for different beat-categories. Categories, such as battle and 
pursuit, are localized reliably. Semantic categories, such as 
romance, victory good and good argue good are harder to 
detect. More advanced features could improve the results 
for these events. Indeed, recognition of characters, their 
pose and speech appear necessary. 

6. Conclusion 

Despite the explosion of user-generated video content, 
people are still watching professionally produced videos 
most of the time. Therefore, the analysis of this kind of 
footage will remain an important task. In this context, Ac¬ 
tion Movie Franchises appears as a challenging benchmark. 
The annotated classes range from reasonably easy to recog¬ 
nize (battle) to very difficult and semantic (good argue bad). 
We also provide baseline results from a method that builds 
on state-of-the-art descriptors and classifiers. Therefore, we 
expect it to be a valuable test case in the coming years. We 
will provide the complete annotations and evaluation scrips 
upon publication. 
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